Project 5: Travel Package Purchase Prediction

Part 1: Setting up the Project

Background

You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

Objective

To predict which customer is more likely to purchase the newly introduced travel package.

Data Dictionary

Customer details:

  1. CustomerID: Unique customer ID
  2. ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
  3. Age: Age of customer
  4. TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
  5. CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
  6. Occupation: Occupation of customer
  7. Gender: Gender of customer
  8. NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
  9. PreferredPropertyStar: Preferred hotel property rating by customer
  10. MaritalStatus: Marital status of customer
  11. NumberOfTrips: Average number of trips in a year by customer
  12. Passport: The customer has a passport or not (0: No, 1: Yes)
  13. OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
  14. NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
  15. Designation: Designation of the customer in the current organization
  16. MonthlyIncome: Gross monthly income of the customer

Customer interaction data:

  1. PitchSatisfactionScore: Sales pitch satisfaction score
  2. ProductPitched: Product pitched by the salesperson
  3. NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
  4. DurationOfPitch: Duration of the pitch by a salesperson to the customer

Importing the necessary libraries...

Part 2: Initial Exploratory Data Analysis

We can see from the .shape function that there are 4888 rows in the dataset. The describe function shows that not all the counts = 4888; which means that there must be null values in the dataset. we also see that many of the variables are coded in binary so there are a lot of zeros. This is acceptable for the categorical columns. It appears that ProdTaken, Passport and OwnCar are in binary format. CityTier, NumberOfPersonVisiting, NumberOfFollowups, PreferredPropertyStar, PitchSatisfactionScore, NumberOfChildrenVisiting are categorical variables. Age, DurationOfPitch, NumberOfTrips appear to be discrete numerical variables, while MonthlyIncome is continuous. CustomerID appears to be a unique identifier. We will have to find if there are any 0s in the continuous data and decide whether or not to replace with different values (ie median or mean).

Checking for missing values and/or duplicates...

There does not appear to be any duplicate rows. At this point, we will not any drop rows, but we may need to if there are rows with important information missing that cannot be easily replaced. Lets look at the datatypes of each column...

There are 20 categories of data and 4888 rows. Three additional columns appeared using .info() that were not displayed using .describe. This is because they are in object format, meaning they are also categorical variables. Before we make any changes to the dataset, lets create a copy called 'TD' - short for Tourism Data - this will allow us to keep the original dataset in case we need to recall it for some reason and gives us a shorthand name for easy manipulation.

Part 3: Cleaning the Data

The first step is to remove the columns that do not provide utility. One such column is ID. ID is a unique key, but considering that it is simply a numbering for the clients, in order, from 200000 to 204888, we can remove this column and rely on the index instead to serve the same purpose. This way, python will not perform unuseful analysis on this column (see .describe() above, for example).

Next, lets convert all the categorical data columns to 'category' types...

Converting the data type of categorical features to 'category'

cat_cols = ['ProdTaken', 'Passport', 'OwnCar', 'CityTier', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'PitchSatisfactionScore', 'Designation', 'NumberOfChildrenVisiting'] TD[cat_cols] = TD[cat_cols].astype('category') TD.info()

Our next step is to rename some of the columns. Many of the column names are long and can be shortened for easier EDA...

ProdTaken

We already knew that there are no missing values in this variable from above. We can see there are only two types of unique variables. 0 is the most frequent, at 3968 so there must be 920 1s or 'yeses'. The variable is binary and answers the question: "Did the customer purchase a package?" (0: No, 1: Yes)

Age

There are 226 missing Age values. We know the mean is 37.6 and the range is between 18 and 61. We will hold off on replacing the missing values right now until we get a better sense of all the data. For instance, there could be rows with many missing values but if we start replacing missing values one at a time with column means or medians, we may not discover this. If a row is missing a lot of important information, it may be best to just drop the row.

Contact

Contact type only has 2 distinct values. There are 25 missing. We will hold off on cleaning missing values until later for reasons mentioned above.

City

We can see there are no missing values in the City variable. We know there are three unique values that represent city tiers. However, there is a problem because the variable is described as follows: "City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3". The problem lies in the fact that python does not understand that Tier 1 is more important that Tier 2, it will look at the numerical representation of the tiers and order them from highest to lowest in importance. This means, it will rank 3 higher than 2 higher than 1. This is the opposite order that management intends. We shall have to rename the values so that python will rank them in importance the way they are meant to be ranked. To avoid confusion, however, we will use the following numbers:

-1 = Tier 3

0 = Tier 2

1 = Tier 1

This will ensure that Tier 1 is ranked higher than Tier 2 which is higher than Tier 3 and hopefully reduce mistakes that could occur if we simply switched 1 with 3 and 3 with 1.

After the relabeling of the city values, we see that most of the city types are Tier 1, followed by Tier 2 with about half as many, and in a distant third is Tier 3. Most customers live in highly developed urban centres.

Duration

The average pitch lasts about 15.5 minutes but ranges from 5 minutes to 127 minutes. We can see there are 251 missing values, which we will deal with later. There is also a huge difference between the 75% quadrile and the maximum. This means there is at least one outlier, which could be a mistake. We will make note of this and examine in more closely in the Univariate Analysis portion.

Occupation

We can see the breakdown of the 4 unique Occupation values. Most customers are salaried or own small businesses. A smaller percentage own larger businesses. There are 2 Free Lancers. These occupations may need to be dropped since they represent such a small part of the sample. We will decide during Univariate Analysis.

Gender

There appears to be no missing values in the Gender variable, however, there are 3 unique labels. This could be correct, depending on what the third label is (ie 'prefer not to say'), but let's check to be sure...

The third label is Fe Male. We can assume this is incorrect and should probably be converted to Female. Let's do this now...

We have reclassified Fe Male as Female. There are no missing values. It appears that there are almost 1/5th more males than females in the group (or 20%).

Visitors

We can see that there are 5 unique values for Visitors. This represents how many total people are going on the tripe. We can see most of the data lies between 2-4. There are no missing values. We will examine the shape of the data more closely and make a decision on the small 1 and 5 values during Univariate Analysis

Followups

There are 45 missing values, which we will deal with below. Most of the data falls under 3 and 4 followups. We will check later in Univariate Analysis to see if the data is bell-shaped.

Product

We can see that there are 5 unique product names and that there are no missing values. Let's see what the product names are...

ASSUMPTION:

This variable looks like it should be ordered. We will assume that the order (from least to most important) is as follows: Basic, Standard, Deluxe, Super Deluxe, King. Python will only know that this is an ordered categorical variable if we change the names to correspond to their importance level. We will use the numbers 1 through 5...

We see that there are no missing values. We have also changed the labels so that they are a ranked category now, instead of being just labels. Python would not have understood they are ordered in importance without this step and treated them as independent.

Stars

We can see there are 26 missing values, which we will deal with later. We can also see that there are just three values, ranging from 3 to 5. Leaving them as numerals will ensure python ranks them, with 5 being the most important, which is proper.

Marital Status

We have zero missing values and 4 unique labels. However, there could be a problem if the labels are not, in fact, unique. For example, is Single and Unmarried synonymous and used interchangeably? Or are they distinct terms that give additional value? Single could me unattached, while unmarried could mean having a partner but not married. It could also be argued that Divorced is not significantly different from Single. Before we decide to combine any values, lets perform Univariate Anlysis first. If we notice high correlations, etc. we may then decide to transform some of the data.

Trips

We see that there are 140 missing values for trips. The mean is 3.23 and the range is 1 to 22. Again, there is a big gap between the 75% quadrile and the max, suggesting the presence of outliers. We will perform Univariate Analysis before transforming this variable.

Passport

There are no missing values. Passport is a binary variable, answering the question: "Does the customer have a passport?". Almost twice as many customers say yes

Score

Ther are no missing values. Most customers prefer 3 stars, while 5,1 and 4 seem evenly represented. Fewer customers prefer 2 stars.

Car

There are no missing values. This category is binary and asks the question: "Does the customer own a car?" Almost twice as many customers own their own vehicle.

Children

There are 66 missing values. Most customers have 1 child. The range is from 0 to 4 children.

Designation

Income

There are 233 missing values. Monthly income ranges from 1000 to 98,678. The max is a lot higher than the 75% quadrile so there are bound to be outliers. The max income equates to almost $1M per year. This could be correct, especially for customers who own big businesses, but we will try to make sense of the outliers during Univariate and Bivariate Analysis.

Part 4: Univariate Data Analysis

Product Taken

Age

Contact

City

Duration

Occupation

Gender

Visitors

Follow Ups

Product

Stars

Marital Status

Trips

Passport

Score

Car

Children

Designation

Income

There seems to be only a few outliers above the 40000 per month mark. Lets look more closely...

There also seems to be two outliers below 10K. Let's look more closely...

Part 5: Bivariate/Multivariate Data Analysis

Numerical vs Numerical

We can start by checking the correlation between the numerical data variables by using .corr and a heatmap function.

Dependent Variable vs Numerical Variables

Prodtaken is the dependent variable and thus the most relevant to the model. We will look at relationships between this variable and other numerical variables first, and then any other numerical variable pairs we might find intersting.

Other Numerical vs Numerical Variables of Interest

We would like to check is if there are strong correlations between Income and the following variables: age, trips, passport, product and city. The last three showed some of the best potential for predictors for whether a customer will make a purchase.

Numerical vs Categorical

Since ProdTaken is the dependent variable (y) we will concentrate on the relationships between ProdTaken and the categorical variables. We will also consider other key relationships.

Numerical vs Categorical - Other variables of interest

Categorical vs Categorical

Part 6: Data Preprocessing

Lets take one more quick look at the missing values in the dataset ...

Outliers

So we know we have 8 columns with missing values to deal with. We have also previously marked outliers which need to be dealt with. Lets deal with the outliers first...

Duration

We know that there were two outliers in Duration. The calls lasted 126 and 127 minutes, respectively. No other call lasted more than 40mins. Clearly, there is an error here. Lets look at the individual rows and check what the mean is for duration...

We know the mean for trips is 15.49 minutes per call and the median is nan. Lets see if we can find more information about grouped means to help us make a good replacement decision...

We can see that the means of all the variables they share in common rounds to either 15 or 16 (duration must be an integer). The overall mean rounds down to 15 and so too does 5 of 9 variables. We shall replace the outliers with 15.

Occupation

We know that there were two entries of Free Lancer in the Occupation data. While not an outlier, they do represent a miniscule portion of the data. It would not be good for our model to include these two occupation types so they need to be dropped.

Trips

We know that there were four outliers in Trips. The four outliers were 19,20,21 and 22, respectively. No other customer took more than 8 trips. This is probably an error, since it would require these individuals to go on trips almost every second week. They are also such a small representation of the dataset that they can be ignored.

We know the mean for trips is 3.24 trips per year and the median is nan. Lets see if we can find more information about grouped means to help us make a good replacement decision...

We can see that the means of all the variables they share in common rounds to 3 (trips must be an integer), lets look at some other variables that at least two of these customers share...

In all cases we can see that the mean of trips taken by these subgroups is rounded to 3. It makes the most sense to correct these outliers by replacing their values with 3.

Income

We know that there were two outliers in Income. The two outliers were 95,000 and 98,678 respectively. No other customer took more than 40,000 in income. This is probably an error, since no one else with their job title or occupation earns anywhere near this amount.

We also discovered through EDA that there were two outliers on the left side of the boxplot as well, where income was much lower than 10k...

Missing Values

Normally, we have been going in sequential order and cleaning the variables as they appear in the column, from left to right. Now, however, we will start with the variable with the fewest missing values and work up. This is because we can check more easily to see if rows are missing several values. These rows should be dropped...

Contact

Contact has 25 missing variables. Lets pull the rows where Contact = nan and see if there are other missing variables in the rows...

Unfortunately, for EVERY instance Contact = nan, so too does Duration = nan AND Income = nan. Thats a lot of missing information in these 25 rows. One option is to replace all these instances with means or medians, however, we begin to start making the data more and more similar when we do this, which can result in multicollinearity and we risk creating a model that overfits. An alternative is to simply drop all 25 rows. We know this will deal with the almost all of the missing duration values simultaneously. Lets check how much of the overall data this represents....

Half of a percentage seems reasonable, especially considering there are three columns of missing data per row. It makes more sense to drop this data than to clean it. Therefore, we will drop it...

Stars

The Stars variable had 26 missing variables. Lets pull the rows where Stars = nan and see if there are other missing variables in the rows...

The rows with missing Stars values have otherwise complete rows. This is good. Unlike Contact, we should keep the data. We can also see a pattern or two. MOST of the customers with missing Stars values were Self Enquiry, are VPs and chose King. We can look at the modes of all three.

We can see that, for all the subgroups, the mode of Stars is 3. We can replace all the nans in Stars with the mode...

Followups

The Followups variable had 45 missing variables. Lets pull the rows where Follows = nan and see if there are other missing variables in the rows...

The other values in the rows where Followups = NaN do not have much in common. There seems to be an even divide between Salaried and Small Business, as well an even divide between Manager and Executive. We can check the modes of these 4 values to see the appropriate replacement value.

We can see that, for all the subgroups, the mode of Followups is 4. We can replace all the nans in Followups with the mode...

Children

The Children variable had 66 missing variables. Lets pull the rows where Children = nan and see if there are other missing variables in the rows...

There are too many rows to see all the data, but one thing that stands out is the salaries are all tightly clustered. It looks as if the salaries stay between 34k and 36k. Lets see if this is true...

As we thought, the range goes from 34636 to 35558. It may be better to use the mode of children for people in this income range...

Unfortunately, there are only 68 rows of customers with this income range, and we know 66 of them Children = NaN. This forces us to abandon this idea, and replace the NaNs with the column mode.

Trips

The Trips variable had 140 missing variables. The number of missing rows is getting to be too big to examine visually, so lets create a dataframe, like we did with children, to look for a pattern....

From the sample we selected, we can see that ProdTaken = 0 in every case, Self Enquiry is high, Salaried and Small Business dominate, most select 3 or 4 visitors, Product is either 3 or 4 and VP or AVP is the most popular Designation. The income range is tight, as we have seen from .describe with a range of 35k to 39k. Lets look at some value counts to select a good group representative....

This is probably not a good indicator, because we know 1 represents just under 19% of the total dataset, and here it represents under 9%.

We can see that, for almost all the subgroups, the mode of Tris is 2. We can replace all the nans in Trips with the mode...

Designation

We will make the assumption that the variables are supposed to be ranked. In any organization, there is a hierarchy of positions. The higher up, the more responsibilites and, generally, the higher salary. The shape of an organization's hierarchy is usually a triangle. With the fewest people at the top, holding most of the power to make decisions. Based on frequencies, we already know the triangle SHOULD look like this (lowest rank to highest): Exeuctive, Manager, Senior Manager, AVP, VP. Lets look at the mean salaries of these positions before making a final decision...

We can see that if we choose to rank the designations by salary, we would get the same result. Therefore, we can go ahead and rank our column.

Income

The Income variable had 208 missing variables. We know from the EDA analysis that Income is moderately correlated with Age and weakly correlated with Trips. We did not see much from the categorical variables to make a link between Income and any of them.

We know that the overall median is nan and we also know that the data is slightly skewed. We should replace nan values with the median, when data is skewed, but this obviously is not going to work in this case, since the median is also nan. One way to overcome this, is to determine the median of Income by Age, since we know its correlated. This does risk making the income and age categories even more correlated, but since replacing 208 values with the median grouped by age is less than 5% of all income data, it should not impact the model too significantly.

The table highlights the fact that the data is slightly skewed since mean = median when there is normal distribution. We also see that at no age is the median = NaN. Therefore, we should be able to replace the NaNs with the median of Income grouped by Age.

Note

The following code was run to randomly select a row where Income = Nan. The aim is to check the age, replace the missing income with the median by age above and check to see if it works below. However, restarting the kernel will select a different row. Therefore, we have run the code it returned row 452, which we will use and then lock in so restarting the kernel will not select a new row.

We can see from our above table that the replacement function worked. However, there may be instances where Age AND Income are NaN...

Unfortunately, there are 130 instances where age and income are nan. Let's skip ahead to Age, clean the Age variable and then continue to replace the missing Income values with the median grouped by age.

Age

The Age variable had 226 missing variables. We know from the EDA analysis that Age is moderately correlated with Income and weakly correlated with Trips. Age also had a relationship with Designation, as individuals aged, they seemed to have more important positions (AVP and VP)

It appears that the Designation median may be better for our purposes because there is more of a spread between the 5 types. We also know there is no missing values in the Designation column. Trips only had a weak correlation with Age. Linking Age and Income would not be a good idea because a) its circular, we did so above and were left with 130 missing values b) it would increase the correlation between the two values (ie if we took the column median for both and filled nans)

We can now go back and replace the nans in Income using the median of Age...

Duration

The Duration variable had 226 missing variables. We know from the EDA analysis that Duration was not correlated with any numerical value. We also couldnt find any strong links between duration and any categorical values. It is perhaps best, then, that we just replace Duration with either the median or mode. We know from EDA analysis, that the data is skewed to the right, therefore median is the more appropriate choice here.

Marital Status

The last variable we need to look at is marital status. We know there are four distinct types but there may be redundancies here. Is Unmarried and Single synonymous? Just how different is Divorced from Single or Unmarried? Lets explore...

Surprisingly, the pair of variables with the closest resemblance including means of age, duration, income, trips and percent who said yes to the product being sold, is married and divorced individuals. It was hoped that unmarried and single would be so similar they could be combined but this is not the case. It seems more prudent to leave all 4 types in place for now.

Converting 'object' datatype columns into categorical variables

Part 7 - Setting up the Model

Split the data into train and test sets

Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will purchase a travel product when the customer ultimately does not.
  2. Predicting a customer will not purchase a travel package, when in fact, the customer would have purchased one had (s)he been contacted.

Which case is more important?

  1. If the model predicts a customer will purchase a travel product but the customer does not buy one, the company wastes resources on marketing spending. The company has indicated that marketing costs are quite high and is actively seeking to lower these costs.
  2. If the model predicts customer will not purchase a product, when in fact, the customer would purchase a product, the company is losing business. This represents an opportunity cost. It threatens profits and can even jeopardize the long-term financial health of the company.

Which metric to optimize?

Decision Tree Classifier

Hyperparameter Tuning

Part 8 - Ensemble Techniques: Bagging

Bagging Classifier

Observations

Random Forest Classifier

Observations

Hyperparameter Tuning

Bagging Classifier

Some of the important hyperparameters available for bagging classifier are:

Let's check different metrics for bagging classifier with best hyperparameters and build a confusion matrix.

Observations

Using logistic regression as the base estimator for bagging classifier:

Observations

Random Forest Classifier

We will try to improve the model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:

Observations

Using class_weights for random forest:

Observations

Comparing all Bagging Models

Observations

Part 9 - Ensemble Techniques: Boosting

AdaBoost Classifier

Hyperparameter Tuning

Gradient Boosting Classifier

Hyperparameter Tuning

XGBoost Classifier

Hyperparameter Tuning

Note

XGBoost with tuning showed the best potential model, so the hyperparameters were experimented with at length. The above code has the final values only, due to the time it takes to process. However, the following values were played with using various combinations...

"n_estimators": [10,30,50, 75, 100, 150, 200],
"scale_pos_weight":[1,2,3,4,5,6,7,8],
"subsample":[0.7,0.9,1],
"learning_rate":[0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]
"gamma":[0,1,3],
"colsample_bytree":[0.5, 0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,0.9, 1],
"colsample_bynode":[0.5,0.7,0.9, 1]

Comparing all Boosting Models

Part 10 - Ensemble Techniques: Stacking

Stacking Classifier

Part 11 - Ensemble Techniques: Choosing the Final Model

Feature Importances

XG Boost

Business Insights and Recommendations

Appendix - Customer Profile of Different Packages

Basic

Basic Product: Customer Profile Summary

Standard

Standard Product: Customer Profile Summary

Deluxe

Deluxe Product: Customer Profile Summary

Super Deluxe

Super Deluxe Product: Customer Profile Summary

King

King Product: Customer Profile Summary